Plot = import("https://esm.run/@observablehq/plot")Benchmark Datasets for Graph Layout Algorithms
1 Introduction
Benchmarking is a crucial aspect of computer science, as it allows researchers, developers, and engineers to compare the performance of various systems, algorithms, or hardware. A benchmark is a standardized test or set of tests used to measure and compare the performance of hardware, software, or systems under specific conditions. Benchmarking aims to provide objective and consistent metrics that allow for fair comparisons and informed decision-making. Benchmarks are widely used in various fields, including computer hardware evaluation, software optimization, and system performance analysis. In all these fields, benchmarking provides a standardized and objective way to compare and assess the performance of different systems, algorithms, or software implementations. It aids in making informed decisions about which solution best suits a specific use case or requirement.
The same is true for graph drawing, particularly for studying the performance and results of graph layout algorithms (Di Bartolomeo et al. 2024). Benchmark datasets can provide a standardized set of graphs with known properties and characteristics. These graphs can vary in size, density, connectivity, and structure. Researchers can objectively compare their performance or the quality of their results by applying various graph layout algorithms to the same benchmark dataset.
In our own work, we have faced challenges in determining which benchmark datasets to use for evaluating the layout algorithms we developed. This led us to build a collection of benchmark datasets used in previous graph layout algorithm papers and a Graph Benchmark Datasets website for perusing the collection. We collected 196 papers from Graph Drawing, IEEE venues, and Eurographics venues that include computational evaluations of graph layout algorithms. We then searched for the datasets used for the benchmarks. We collected the data we could find and had permission to archive and re-created datasets that were lost but had sufficient replication instructions. We classified graphs by their features and statistics. We also found text and images from papers using those graphs.
This paper aims to present this graph drawing benchmark sets resource to the Graph Drawing and visualization communities so that other authors may benefit from our archiving and organization efforts. We hope this resource will encourage the discoverability of these datasets and the ease of running benchmarks for graph layout algorithms. Moreover, as reliable access to datasets is fundamental for replicability, we aim to preserve these datasets in perpetuity. Beyond collecting available datasets and re-creating lost ones, we archived all our materials on OSF for long-term availability. This included saving each graph in multiple common file formats to avoid translation issues for individual authors. We believe our work will lead to more reproducible and replicable Graph Drawing research by providing a long-term and open archive of the data we use in our computational evaluations.
Specifically, we contribute:
- A collection of benchmark datasets used for graph layout algorithm evaluations, including:
- Re-creating lost datasets based on paper descriptions and a list of unavailable data,
- A classification of these datasets by graph features and statistics to aid in identifying appropriate evaluation candidates,
- Exemplar text and images from papers that use these datasets,
- A website for perusing this collection, and
- A long-term archive of our metadata and the collection to aid in reproducibility and replicability of evaluations.
Please see Section 5 below for our supplemental materials, including links to the website, code and data, OSF archive, and Notion database.
2 Collection process
The information we collected is a by-product of a larger systematic review we conducted related to graph layout algorithms, which included 196 papers The following figure shows the original data collection process (from (Di Bartolomeo et al. 2024)):
The core of our data collection was the last seven years of Graph Drawing proceedings (264 papers in total), filtering out papers without computational evaluations. We further expanded our graph layout algorithm papers collection by searching IEEE Xplore and Wiley digital libraries to include papers from TVCG and CGF. Then, we checked all the citations in the papers we collected from Graph Drawing, and added to our collection all the papers that were cited more than 4 times in the last 6 years of graph drawing — to make sure we included algorithm papers that were important, but not published at GD, on IEEE Xplore or on the Wiley digital library. For each paper, we collected which features were handled by the graph layout algorithm presented, and what dataset was used in the evaluation. When collecting features, we always prioritized the authors’ own wording and description of the features. The tagging of the papers was done by two people at the same time, over two different passes for sanity-checking purposes. Following this process, we tried to track down the datasets used in computational evaluations: (1) we first looked for official or linked supplemental material, (2) we next Googled the dataset or paper name, (3) finally, we emailed the authors. When in doubt about the artifact replication policy, we asked the owners or authors by email. In cases where it was explicitly mentioned that approval should be received before redistribution, we did not redistribute the datasets. However, if we received approval or did not receive an answer and found no explicit policy preventing redistirbution, we collected and stored the dataset to preserve it for future researchers. If any dataset owner or author discovers their own work in our collection and would like it removed, we kindly request that they contact us (see Section 7 below), and we will promptly remove it. Furthermore, we want to emphasize that we do not assert any ownership rights over the datasets listed.
The chart below shows the distribution of papers across different venues:
The following one, instead, shows the distribution of collected papers’ publication date:
After collecting the datasets, we looked more in-depth into their contents, running analysis on a number of statistics associated with the graphs contained in them.
We finally ended up collecting 46 datasets, listed below:
The purpose of our collection effort is to facilitate comparisons and replication of results, and to provide a resource for researchers to find datasets that are used in the literature. We also provide links to the specific papers that use each dataset, so that they can be used as examples and sources of information.
3 Datasets in use
The following chart shows how many times we found a dataset being used in the papers we collected. It excludes custom edits to the datasets, which are discussed later in this document.
In the data we collected, the most used dataset is Rome-Lib, followed by assorted collaboration networks (which in many cases refers to datasets of academic collaborations such as dblp or vispubdata). The third most used dataset is from C.Walshaw - it is important to note that the Walshaw dataset is available as part of other collections - for instance, its graphs are found in SNAP. However, during the collection process, we preferred giving precedence to how the authors reported their own information. Thus, if the authors claimed the data was from the Walshaw collection, we reported it as such.
In the following sections, the reader will find details about the classifications and datasets in detail. Each dataset gets a dedicated, collapsible section, that contains the following information: - A sparkline visualization is shown in case the dataset contains multiple graphs, illustrating the distribution of graph sizes (in nodes) in the graphs in the dataset.
3.1 Classification of the Datasets
We classified the datasets in different categories, based on their origins or amount of graphs contained in them:
Uniform Benchmark datasets
Uniform Benchmark datasets are standalone widely used collections of graphs featuring uniform characteristics - usually simple, generic graphs, often used in evaluations that run over thousands of graphs to report average completion times, or other experiments where the reported metrics are usually aggregated.
The first of these collapsible sections is shown already expanded, to give an example of the content that can be found in each of them. The content is generated dynamically based on the data we collected.
The following collections, together with Rome-Lib, can be easily accessed from the homepage of the Graph Drawing Conference website, and are therefore well known and widely used.
The datasets listed below are particularly useful when researchers are interested in testing for crossing numbers. Indeed, they both deal with graphs that have a known crossing numbers.
The collections presented here are particularly varied in features:
The following two collections are particularly relevant for the display of temporal event sequences:
Collections of graphs that focus on one specific attribute can also be useful in case the attribute relates to geographical data:
Otherwise, one might decide to focus on a particular domain:
And finally, a collection of graph problems:
Established Network Repositories
A popular choice is to use datasets from Established Network Repositories. These are ample collections, often organized in dedicated websites which also offer a few stats about the contained graphs. Because their hosts are already dedicated to the maintaining and reporting of information on these collections, we do not include here any storage of the data (which would be redundant) or report statistics on them. Rather, our analysis here is focused on highlighting their properties, origins, and ways in which they have been used.
Compiled by the Mathematical and Computational Sciences Division of the Information Technology Laboratory of the National Institute of Standards and Technology, the Matrix Market is a repository of test data for use in comparative studies of algorithms for numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generation tools and services. It has been used for experiments with generic graphs, large graphs, and graphs with weighted edges. Each matrix entry is accompanied by metadata that includes the matrix name and identifier, dimensions (number of rows and columns), number of non-zero elements, symmetry type (symmetric, skew-symmetric, or general) and data type (real, complex, integer, or pattern). The website also provides visualizations of the matrices, helping users understand their structure and distribution of non-zero elements. Downloads are also provided in a variety of formats, including their own Matrix Market Exchange (MME) format, Harwell-Boeing, and MATLAB.
The Network Repository was proposed in 2015 by Rossi and Ahmed of Purdue University as a visually interactive real world graph tool and database. Graphs in this repository are tagged with their real-world source, have in-depth analysis, and even a preview visualization of each individual graph. Types of graphs include social networks, infrastructure networks, biological networks, and many others. The repository also offers sources for individual graphs. It contains many graphs from other benchmark sets described here.
The Pajek Program for Large Network Analysis is a tool developed and hosted by Andrej Vlado and some of their colleagues. As part of this program, they later compiled relevant graphs and links to other datasets, which we call today the Pajek collection. As a curiosity, pajek means spider in Slovenian. Many of Pajek graphs have been included as part of the SuiteSparse Matrix Collection.
The SNAP repository is a collection of datasets assembled as part of the Stanford Network Analysis Platform, which started in 2004 and launched their website in 2009. Well-known, widely used graph repository. A number of graphs that became relevant individually are included in SNAP, such as Enron, Amazon, Protein Interactions datasets and various Social Network graphs. SNAP contains generic graphs, directed and undirected graphs, dynamic graphs and more. Graphs are presented with name and descriptions and a few statistics such as a general description, numbers of nodes and edges, source and reference information.
Compiled by Donald Knuth, the Stanford Graphbase consists of programs and datasets for network analysis. The datasets accompany a textbook, “The Stanford GraphBase: A Platform for Combinatorial Computing”.
The SuiteSparse Matrix Collection, formerly known as the University of Florida Sparse Matrix Collection, is a comprehensive repository of 2893 sparse matrices. All graphs in SuiteSparse belong to groups which will have more information about the graphs and the sub-collections they belong to. In our Descriptions from the Literature section we also highlight a few tables with the specific graphs used in a couple of papers. From “The university of Florida Sparse Matrix Collection”, Davis and Hu describe the origin of this network repository. Namely they cite the Harwell-Boeing collection as the starting point for SuiteSparse, then called the University of Florida (UF) Sparse matrix collection, back in 1991. Other groups, or collections, have then been added to SuitseSparse through the years, mainly focusing on real-world matrices and other relevant problems related to them.
Single Graphs
A number of papers used individual, Single Graphs for their experiments instead of a collection. These graphs have become popular because of their properties as individual graphs - see, for example, the Enron dataset below, a huge individual graph that has a large variance in node degree distribution. Many of these graphs are also available in other repositories - their locations are noted wherever known.
Aggregate Collections
Many papers used graphs from specific domains that contain particular characteristics (e.g., geographical coordinates are often found in airline data). Instead of collecting each one of these individual, contextual datasets, we aggregated them in further subcategories, and called them Aggregate collections. Individual information about about each aggregate collection can be found in the papers that contain them.
Subsets of other collections
Finally, we collected some datasets that we noticed being subsets of existing collections. This is a phenomenon that can happen through the years, through the redistribution and through the merging of different sources: the Walshaw dataset, for instance, was and still is distributed and cited as its own standalone dataset, but its graphs can be found as part of many other larger collections, such as SNAP. We classified these datasets as Subsets.
Custom-made datasets
In the data we collected, we also found several instances of custom-made datasets. We consider custom-made datasets either edits to pre-existing datasets, where the authors found it necessary to either split or modify the dataset in a particular way, or datasets completely made up from scratch using random generators or custom-made code. This can happen in cases where the authors of a paper needed a dataset containing particular characteristics which was not easy to find in the wild, so a new dataset was crafted.
For instance, consider the case where the authors of a paper develop an algorithm that works on hypergraphs. They want to test that the algorithm works, and test its performance on hypergraphs of various sizes, but datasets containing hypergraphs are difficult to find. For this reason, the authors craft one dataset synthetically, or take a pre-existing dataset and edit it so that it now contains hyperedges.
We split custom-made datasets in three categories, with their occurrences in the corpus of papers illustrated below:
Replicable datasets indicate cases where the authors have given enough information so that the experiment can be replicated exactly as it was run by the authors of a paper, or closely enough that the results obtained reflect the published ones very closely. This includes cases where either the authors published the entire dataset they used, they published the code they used to generate the dataset, or include an exact description of the steps they took to generate it.
Reproducible datasets are cases where the authors described the steps they took to generate and/or edit their datasets, but not in-depth enough so that the exact same graphs can be reproduced, and did not redistribute it. Results can still be reproduces somewhat closely if the authors took care to report enough information about their graphs.
For non-replicable datasets, we indicate cases where the authors did not distribute their datasets and did not include enough information in the paper so that their results could be replicated.
This information is closely tied to the distribution of supplemental material in papers, that is shown in the chart below:
This discussion is part of a larger discourse on research replicability, that is gaining traction in the scientific community. The ACM, for instance, has a policy on artifact review and badging, where authors are encouraged to submit their artifacts for review, and if they pass, they receive a badge that indicates the artifact is available for review. This is a step towards making research more replicable and reproducible, and we hope that our work will contribute to this effort.
See, e.g., ACM’s definitions at https://www.acm.org/publications/policies/artifact-review-and-badging-current.
3.4 Details of the datasets
For each of the datasets collected as part of our process, we conducted a brief analysis of their contents. Where possible, the analysis includes information about the number of nodes per graph, the source of the dataset, which papers have used the dataset and what graph features they took into account. The following section defaults to illustrating the contents of Rome-Lib, but the reader can easily change the dataset in focus by modifying the value of the dropdown menu below:
3.5 Random generation
We discussed above that in a few cases, authors have generated their own datasets to test their algorithms. We call these generated datasets synthetic or custom, as opposed to datasets found in the wild. Generating a synthetic dataset can be essential for several reasons, particularly in the context of algorithm development and validation:
- Privacy and Publication Constraints: Imagine you’ve created an innovative algorithm using a dataset that, due to privacy concerns, cannot be shared publicly. To validate and share your work without compromising data confidentiality, creating a synthetic dataset that mirrors the essential features of the original data allows you to showcase your algorithm’s effectiveness while adhering to privacy restrictions.
- Benchmarking and Comparative Analysis: Suppose you have access to a single dataset that can be shared and wish to prove your algorithm’s robustness across various scenarios. Generating synthetic datasets with comparable characteristics provides a controlled environment to conduct benchmark studies. This approach enables scientists to demonstrate that your algorithm performs consistently well across datasets with analogous features, thereby reinforcing its applicability and reliability.
- Addressing Data Availability Issues: In certain situations, you might design an algorithm to process graphs with specific attributes only to discover the absence of publicly available datasets showcasing these features. Synthetic datasets become invaluable here, allowing you to create tailored data that incorporates the necessary characteristics. This approach not only facilitates the testing of your algorithm in a relevant context but also helps in illustrating its potential in hypothetical yet plausible scenarios.
We use the word synthetic too in cases where the dataset has been altered, sliced, or other modifications have been applied to it.
The following is a practical example of a case where you would need to edit a network:
Imagine you are building a visualization to deal with the entirety of the Enron Corpus dataset , which has hundreds of thousands of nodes and edges. Because of its size, you decide to slice up the large dataset in many smaller files, so you can run tests. This particular dataset, in addition to being a challenging problem because of its size, also has an interesting distribution of connection densities: some nodes are extremely well connected, while others are much less connected. Indeed, the dataset is comprised of a collection of emails sent by Enron executives - between themselves or between other employees of the company. Because of how the dataset is constructed - where only the emails from the executives are taken into account - it has a distinctive skew in the connectedness of the data: 158 nodes are extremely well connected, while the rest of the nodes are much less connected. Because of this, there’s uncountable mistakes that can be done through slicing this graph: for instance, slicing it so that the subset includes a less dense section of the entire graph will fail to provide a representative section.
While the creation of a synthetic dataset is a perfectly viable way to produce a benchmark set, in addition to replicability criteria (as discussed above), we also have to pay particular attention to a number of statistics related to the structure of graphs — both when generating new datasets, and when slicing up existing datasets to reduce their size, or to create more graphs from a single, larger graph.
The list of features to take into account to claim that a synthetic graph is comparable to another one would be long, and perhaps out of the scope of this publication. These are just a few examples of what could be relevant:
- Size (nodes, edges): The total number of nodes and edges in the graph, indicating its overall scale.
- Diameter: The longest shortest path between any two nodes, showing the graph’s maximum extent.
- Density: The ratio of existing edges to possible edges, reflecting how closely knit the graph is.
- Motifs: Recurring, significant subgraphs; identify which motifs occur more or less frequently than expected.
- Connectedness: Evaluates both the number and sizes of graph components, along with detailed analysis per component.
- Centrality Measures (betweenness, closeness, degree): Quantify the importance of nodes based on their position and connections within the graph.
- Special Cases:
- Layers: Characteristics like number of layers, nodes per layer, and inter-layer connections.
- Disjoint Groups: The count and size distribution of separate clusters, plus an analysis of each group’s properties.
- Overlapping Sets: Number and size distribution of intersecting groups, along with detailed features.
- Additional Node/Edge Data: Information such as timestamps, attributes, or weights that add context to nodes/edges.
- Dynamic: Describes changes in the graph over time, including the nature and frequency of node/edge modifications.
The generation of random graphs that accurately mimic specific features presents a complex challenge, that has been explored abundantly in the past, but has found no universal solution yet. Two examples of popular models used to generate synthetic datasets are the Erdős-Rényi (ER) and the Barabási-Albert (BA) models, each one with their distinct focus and limitations:
The ER model excels in creating graphs with uniform edge distribution, ideal for testing an algorithm’s ability to evenly space nodes and reduce edge crossings. However, it falls short in replicating the complex, non-uniform connections seen in real-world networks, limiting its applicability to scenarios requiring simplicity and randomness. Conversely, the BA model produces scale-free networks with a few highly connected nodes, reflecting the hierarchical structure of many natural and human-made systems. It challenges layout algorithms to effectively display these hubs without cluttering. The limitation here is its focus on growth and preferential attachment, which might not suit all types of networks, particularly those without clear hub structures.
As there is no universal solution for random graph generation, our recommendation is to try as much as possible to pay attention to research replicability criteria, such as redistributing the generated dataset as supplemental material in the paper.
4 Conclusion
We presented a comprehensive benchmark dataset collection for graph layout algorithms. Compiling, organizing, and making a wide array of datasets with diverse characteristics accessible not only facilitates rigorous and fair comparisons of algorithmic performance but also addresses critical issues of replicability and reproducibility in research. The Graph Benchmark Datasets website, along with the efforts for long-term archival, ensures these valuable resources remain available to the community. Web aim to help both current and future researchers to benefit from a rich repository of data, standardizing the evaluation process for graph layout algorithms, fostering innovation and development within the field and contributes to the advancement of graph drawing and visualization research. With this work, we hope to underscore the importance of accessible, well-documented benchmark datasets in driving scientific progress and highlights our commitment to enhancing the integrity and reliability of computational evaluations in the Graph Drawing community.
5 Research Material Statements
The supplemental material for this publication includes:
- The Graph Benchmark Datasets website, hosted on GitHub Pages.
- Code for the website, metadata on the benchmark datasets, and data processing scripts are stored in the gd_benchmark_sets GitHub repository. TODO (CD 2024-03-08): clean up and add documentation for how to use. Esp. clean the
datafolder,readme.md, the random.ipynbfiles. AddLICENCEfile specifying ideally Apache 2.0 for code, and info toreadme.mdspecifying Apache 2.0 for code and CC BY 4.0 for data & metadata. - This code and metadata, along with the benchmark datasets themselves in several common graph file formats, are archived for long-term availability in the Benchmark datasets for graph drawing project on OSF. TODO (CD 2024-03-08): add the website code + data. This can be done by creating a registration on OSF. I linked the GitHub repo as a data source on OSF, but the registration is what copies it to OSF as an archived snapshot.
- The original database used for metadata collection and storage is available on Notion as Benchmark datasets.
6 Acknowledgements
This work was supported in part by the U.S. National Science Foundation (NSF) under award number CAREER IIS-1762268 and the Austrian Science Fund (FWF) [ESP 513-N].
8 License
This work is licensed under a Creative Commons Attribution 4.0 International License. All code is released under the Apache 2.0 License.
9 Conflict of Interest
The authors declare that there are no competing interests.